1. Introduction

Crime measurement and prediction is an important and complex topic in city management. The police department wants to explore the distribution pattern of the crime using the past crime data, and then predict the potential crime risk in future. The prediction can be used to allocate the personal in a more “efficient” way, where allocate more and probably frequent police resource in under performed area to combat crime. According to Logan (2008), one of the crime collection method in USA is incident-based reporting. However, some arguments concerning the accuracy and equity of the reported crime data has emerging in recent year considering the following reasons:

    1. Victims feel and act differently concerning same crime behavior. For example, some people will report the incident while others may choose to be silent out of fear of retaliation by the criminals. Some people might also overstate the situation to increase police presence. It’s largely related to the person’s cultural background, identity, and where does the crime happen.
    1. If some area is over-policing, for example some black majority neighborhoods, the observed and reported cases will go high. This might also vary by types of crime, for instance, the reported burglary might be more trustworthy than reported weapon violation because the latter one is more likely to be exposed in the over-policing area and cause seemingly denser incidents in that area

Therefore, if the crime prediction is based on this type of inaccurate data, it may result in inefficient police allocating and potential biased policing.

According to Chicago’s incident of crime data, in 2017, the city had about 3,600 reported weapon violation concerning unlawful use of handgun. The distribution of the reported data shows that this type of crime is highly concentrated in several area of the city. Without further investigation of the data, people might draw the conclusion that handgun violation is most likely to occur in those hot spots in the future, thus more regulation or patron are reasonable. However, it it true? Can we fully trust this data and corresponding forecasting?

This project intends to use Chicago, a city suffered by notorious crime for long time, to explore the spatial distribution of the crime, and come up with a new crime predict method using spatial characteristics features which are more likely to portrait where is the crime, and without relaying of previous reported crimes (the current mainstream method). Model’s generalizing ability across the city will also be evaluated at the end, especially focusing on its crime prediction in different racial neighborhoods.

setwd("~/Desktop/Upenn-Y1/2019Fall/590/HW7_Geo_risk_prediction")
library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(raster)
library(wesanderson)
library(spatstat)
#library(trendsceek)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

2. Background

2.1 Chicago Police Adiministration

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform("ESRI:102271") %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform("ESRI:102271") %>%
  dplyr::select(District = beat_num)

chicagoBoundary <- 
  st_read("/Users/jingzhang/Desktop/Upenn-Y1/2019Fall/590/HW7_Geo_risk_prediction/riskPrediction_data/chicagoBoundary.shp")%>%
  st_transform("ESRI:102271")

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

ggplot() +
  geom_sf(data = bothPoliceUnits) +
  facet_wrap(~Legend) +
  labs(title = "Police adminstrative areas, Chicago") +
  mapTheme()

#create fishnet 
fishnet <- 
  st_make_grid(chicagoBoundary, cellsize = 500) %>%
  st_sf()

# select only Chicago's net, add unique ID
fishnet <- 
  fishnet[chicagoBoundary,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)

2.2 Weapon violations in Chicago

`

weapons<- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
   filter(Primary.Type == "WEAPONS VIOLATION" &
           Description == "UNLAWFUL POSS OF HANDGUN")%>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x, into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform("ESRI:102271") %>% 
  distinct()

crime_net <- 
  weapons %>% 
  dplyr::select() %>% 
  mutate(countWeapons = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countWeapons = ifelse(is.na(countWeapons), 0, countWeapons),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

This map visualizes the Weapon Violation (Unlawful Poss of Handgun) cases in Chicago in 2017. Notably, the cases are not evenly distributed in the city, and we can observe some clustering of handgun violations. For example, we can tell through the map that the north-central west part, as well as the south-central east part, seems to have high-concentrated handgun violations. However, it seems completely absent in the far northwest and southeast of the city. As some points are overlapped, some clues might be missed through this simple map, let’s visualize it a more accurate way.

ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = weapons, colour="palevioletred1", size=0.02, show.legend = "point") +
  labs(title= "Handgun Violation in Chicago, 2017")+
  mapTheme()

Apart from plotting the incidents directly, we can use the fishnet base map using 500m*500m grid as the basic unit, and then measure the number of cases within each grid. The following map presents a powerful and pretty straightforward info-graph for us to analyze and understand the volume of crime data and its spatial distribution. What we can observe in the following fishnet map is generally consistent with the former map that there are two obvious handgun violation hot spots in Chicago. The fishnet map seems further to show that the crime in the northwest hot spot is more concentrated, and the south-central one seems more diffuse and even extends east to the coast.

ggplot() +
  geom_sf(data = crime_net, aes(fill = countWeapons), lwd=0.1) +
  scale_fill_viridis(option="cividis") +
  labs(title = "Count of Handgun Violation by Fishnet in Chicago, 2017")+
  mapTheme()

3. Feature engineering

Two types of factors were selected in this project: risk factor and protective factor. The basic hypothesis of handgun violation is that, we assume this type of crime always happens in blight areas with many abandoned houses, abandoned cars, or empty lots because those objects can protect the crime from being detected, thereby are usually suitable for “crime”. Thus, we select some services records from 311 Service Requests from Chicago Data Portal and try to capture the built environment feature of “blight area”. Alcohol is another important trigger for the crime so we select the density of “liquor retail” as a risk factor. Apart from that, many American cities including Chicago suffered “center decline” for a long time, which means that although there are many commercial or shopping places in the city center, it’s also a crime-prone area at night. Thus, distance to “the Loop” city center was also selected as the risk factor. We also use some protective factors such as police stations, schools, and Broadband Innovation Zone, which is likely to provide more “eye on the street” and prevent potential crime. The potential features can be summaries as:

  • AlleyLightsOut: report of the Alley without light
  • AbandonBuildings: open and vacant buildings
  • RodentBaiting: All open rodent baiting requests and rat complaints made
  • Sanitation: report of violation of Chicago’s sanitation code, e.g. overflowing dumpsters and garbage in the alley
  • graffiti: record of graffiti removal requests
  • abandonCars: open and abandoned vehicle complaints
  • treeDebris:All open tree debris removal requests
  • distance to
  • Police station:
  • School
AlleyLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Alley-Lights-Out-Historical/t28b-ys7j/data") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Alley_Lights_Out")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")


sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Cars")



liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur   rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")


PoliceStation <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv/data")%>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "PoliceStation")
head(PoliceStation)

treeDebris <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Tree-Debris-Historical/mab8-y9h3/data")%>%
dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "treeDebris")

RodentBaiting <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-No-Duplicates/uqhs-j723")%>%
  mutate(year = substr(Completion.Date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = Latitude, X = Longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Rodent_Baiting")


bussinessNoL <- 
   read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q")%>%
  filter(CITY == "CHICAGO")%>%
  filter(!str_detect(BUSINESS.ACTIVITY, "Liquor"))%>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
              na.omit() %>%
              st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
              st_transform(st_crs(fishnet)) %>%
              mutate(Legend = "bussiness_without_Liquor")

school <- st_read("https://data.cityofchicago.org/api/geospatial/9zky-nrsy?method=export&format=GeoJSON")%>%
  dplyr::select(Y = lat, X = long) %>%
              na.omit() %>%
              st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
              st_transform(st_crs(fishnet)) %>%
              mutate(Legend = "school")%>%
  dplyr::select( geometry,Legend) 



neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 
  
# The Loop, Chicago’s downtown
neighborhoods_Loop <-
  neighborhoods %>%
  filter(name == "Loop")%>%
  mutate(Legend = "City_center_Loop")%>%
   dplyr::select(Legend)


Broadband_Innovation_Zones <- 
  st_read("https://data.cityofchicago.org/api/geospatial/jkfs-hwcr?method=export&format=GeoJSON") %>%
  st_transform(st_crs(fishnet)) %>%
   mutate(Legend = "Broadband_Innovation_Zones")


BIZPoint <-
  Broadband_Innovation_Zones %>%
  st_centroid()

Broadband_Innovation_Zones <- Broadband_Innovation_Zones%>%
  dplyr::select(Legend)

vars_net <- 
  rbind(AlleyLightsOut, abandonBuildings, sanitation, graffiti, abandonCars, liquorRetail, treeDebris, RodentBaiting, bussinessNoL, PoliceStation, school,Broadband_Innovation_Zones, neighborhoods_Loop) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()
ggplot() +
  geom_sf(data=chicagoBoundary) +
  geom_sf(data = rbind(AlleyLightsOut, abandonBuildings, sanitation, graffiti, abandonCars,treeDebris,RodentBaiting,bussinessNoL,
                       liquorRetail, school, PoliceStation, neighborhoods_Loop,Broadband_Innovation_Zones),
          size = .1) +
  facet_wrap(~Legend, ncol = 4) +
  labs(title = "Risk Factors") +  
  mapTheme()

3.1 Count of Features

This project provides two methods to engineer the feature to extract the most valuable information from the data. The first is count the number of item in each fishnet. The second type measures the distance from the center of each grid to the nearest item (nearest neighbor distance), which aims to avoid the downside of using pre-defined geography as the unit for defining the data (e.g. the fishnet grid).

vars_net.count.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.count.long$Variable)
vars.count <- vars[c(1:3,5,7:13)]

mapList <- list()

for(i in vars.count){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.count.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(option="cividis",name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =4, top = "Count of Potenrial Features by Fishnet"))

3.2 Nearest Neighbor Distance

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

# measure the distance from the centro of each grid to the nearest item
vars_net$Abandoned_Buildings.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)
    
vars_net$Abandoned_Cars.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)
    
vars_net$Graffiti.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)
    
vars_net$Liquor_Retail.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)

vars_net$Alley_Lights_Out.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(AlleyLightsOut), 3)
    
vars_net$Sanitation.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)

vars_net$TreeDebris.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(treeDebris), 3)

vars_net$RodentBaiting.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(RodentBaiting), 3)

vars_net$BussinessNoL.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(bussinessNoL), 3)

vars_net$PoliceStation.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(PoliceStation), 3)

vars_net$school.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(school), 3)

vars_net$BIZ.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(BIZPoint), 3)
vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(option="cividis") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =4, top = "Nearest Neighbor Factors by Fishnet"))

3.3. Euclidean distance

As we mentioned the “city center decline” might also contributed to the crime, we thereby measure the euclidean distance from the centroid of The Loop (downtown) to each grid cell’s centre point. The Loop that this project is defined by the neighborhood data

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 
  
# The Loop, Chicago’s downtown
loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

# meansure distance to downtown
vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis(option="cividis") +
  labs(title="Euclidean distance to The Loop") +
  mapTheme() 

#final_net = vars_net + crime_net
final_net <-
  left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID") 

#add neighbourhood name and policeDistrict into final_net 
final_net <-
  st_centroid(final_net) %>%
    st_join(., dplyr::select(neighborhoods, name)) %>%
    st_join(., dplyr::select(policeDistricts, District)) %>%
    st_join(., dplyr::select(policeBeats, District)) %>%
      st_set_geometry(NULL) %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

final_net.long.nn <- 
  final_net %>%
  dplyr::select(Abandoned_Buildings.nn, Abandoned_Cars.nn, Graffiti, Liquor_Retail,
               Alley_Lights_Out, Sanitation, loopDistance, treeDebris,
                RodentBaiting.nn, BIZ.nn, BussinessNoL.nn, school.nn, PoliceStation) %>%
  gather(AllFeature, value, -geometry)

final_net_nn <- unique(final_net.long.nn$AllFeature)
mapList <- list()

4.Exploring the Spatial Structure

4.1. Local Moran’s I

To take the spacial distribution of simple assault into consideration, we use Local Moran’s I to measure whether certain net’s count of “simple assault” is related to its surrounding nets. Higher Ii implies that this is a cluster of high “simple assault” rate, or called Simple Assault Hotspot.

According to the Local Morans’I & P value map, we can get a statistically significant “simple assault” hotspot map. Figure shows the nearest distance of each fishnet to the significant “simple assault” which is the most important spatial structure which will be added into our model.

final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

set.ZeroPolicyOption(TRUE)
## [1] FALSE
final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countWeapons, final_net.weights)),
    as.data.frame(final_net, NULL)) %>% 
    st_sf() %>%
    dplyr::select(Handgun_Violation_Count = countWeapons, 
                  Local_Morans_I = Ii, 
                  P_Value = `Pr(z != E(Ii))`) %>%
    mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
    gather(Variable, Value, -geometry)
  
#loop for map
vars_localM <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars_localM){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
      scale_fill_viridis(option="cividis") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans'I Statistics of Handgun Violation"))

final_net <-
  final_net %>% 
  mutate(handgun.isSig = ifelse(localmoran(final_net$countWeapon, 
                                            final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(handgun.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                           st_coordinates(st_centroid(
                                             filter(final_net, handgun.isSig == 1))), 1 ))

ggplot() + 
  geom_sf(data = final_net, aes(fill = handgun.isSig.dist),lwd=0.1) +
  scale_fill_viridis(option="cividis") +
  labs(title = "Distance to Highly Significant Handgun Violation Hotspots") +
  mapTheme()

4.2. Correlations Test

The following correlation plots show the between the selected features (risk & protected) and the number of handgun violations. Overall, it shows that most of the features that we selected have a strong correlation with the number of handgun violations, while it’s obvious that “count in each grid” and “nearest neighbor distance” are colinear, so for the final selection, the one with weaker correlation will be removed from the following model.

correlation.long <-
  st_set_geometry(final_net, NULL) %>%
  dplyr::select(c(1,4:29), -c("PoliceStation","City_center_Loop","Broadband_Innovation_Zones"))%>%
    #dplyr::select(countWeapons, Abandoned_Buildings.nn, Abandoned_Cars.nn, Graffiti, Liquor_Retail, Alley_Lights_Out.nn, Sanitation.nn, loopDistance, TreeDebris.nn, RodentBaiting.nn, BIZ.nn, BussinessNoL.nn, school.nn, PoliceStation) %>%
    gather(Variable, Value, -countWeapons )


correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countWeapons, use = "complete.obs"))




##### need count 
ggplot(correlation.long, aes(Value, countWeapons)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "palevioletred1", lwd=1) +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Handgun Violation count as a function of risk or protective factors")

ggplot(final_net, aes(countWeapons)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of Handgun Violation by grid cell")

5. Cross Validation

5.1. Cross-validated poisson Regression

We use four types of cross validation to check the errors: 1. random k-fold, just risk factor 2. random k-fold, risk factor + spatial structure (distance to the assault hotspot) 3. LOGO-CV (Leave-one-group-out cross-validation), just risk factor 4. LOGO-CV (Leave-one-group-out cross-validation), just risk factor + spatial structure (distance to the assault hotspot)

reg.vars <- c("Abandoned_Buildings.nn" , "Abandoned_Cars.nn" , "Graffiti" ,  "Liquor_Retail" , "Alley_Lights_Out.nn" , "Sanitation.nn" , "loopDistance" , "TreeDebris.nn" , "RodentBaiting.nn" , "BIZ.nn", "BussinessNoL.nn" , "school.nn" , "PoliceStation" , "school" , "PoliceStation.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn" , "Abandoned_Cars.nn" , "Graffiti" ,  "Liquor_Retail" , "Alley_Lights_Out.nn" , "Sanitation.nn" , "loopDistance" , "TreeDebris.nn" , "RodentBaiting.nn" , "BIZ.nn", "BussinessNoL.nn" , "handgun.isSig" , "handgun.isSig.dist" , "school.nn" , "PoliceStation" , "school" , "PoliceStation.nn")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i
#  cat("This hold out fold is", thisFold, "\n")

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  
  regression <-
    glm(countWeapons ~ ., family = "poisson", 
      data = fold.train %>% 
      dplyr::select(-geometry, -id))
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countWeapons",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countWeapons, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countWeapons",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countWeapons, Prediction, geometry)
  
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countWeapons",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countWeapons, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countWeapons",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countWeapons, Prediction, geometry)

5.2. Accuracy

Through the comparison of 4 prediction and the observation of assault, we find that the model with spatial-related variables predict better. While the result beween k-fold and LOGO cross-validation method is not such obvious in this study (erro compare)

reg.summary <- 
  rbind(
    mutate(reg.cv, Error = countWeapons - Prediction,
                   Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv, Error = countWeapons - Prediction,
                      Regression = "Random k-fold CV: Spatial Structure"),
    
    mutate(reg.spatialCV, Error = countWeapons - Prediction,
                          Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = countWeapons - Prediction,
                             Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
    st_sf() 


grid.arrange(
  reg.summary %>%
    ggplot() +
      geom_sf(aes(fill = Prediction), lwd=0.1) +
      facet_wrap(~Regression) +
      scale_fill_viridis(option="cividis") +
      labs(title = "Predicted Handgun Violation by Regression", 
           subtitle = "Figure 1") +
      mapTheme() + theme(legend.position="bottom"),

  filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
    ggplot() +
      geom_sf(aes(fill = countWeapons), lwd=0.1) +
      scale_fill_viridis(option="cividis") +
      labs(title = "Observed Handgun Violation\n", 
           subtitle = "Figure 1") +
      mapTheme() + theme(legend.position="bottom"), ncol = 2)

filter(reg.summary) %>%
  ggplot() +
    geom_sf(aes(fill = Error), lwd=0.1) +
    facet_wrap(~Regression) +
    scale_fill_viridis(option="cividis") +
    labs(title = "Handgun Violation errors by Regression") +
    mapTheme()

This table shows that random k-fold cv with spatial structure variables is the most suitable method for this study since it shows the smallest Mean Abosulute Error and standard deveation

st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  summarize(MAE = round(mean(abs(Prediction - countWeapons), na.rm = T),2),
            SD_MAE = round(sd(abs(Prediction - countWeapons), na.rm = T),2)) %>% 
  kable(caption = "Table: MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "palevioletred1") %>%
    row_spec(4, color = "black", background = "palevioletred1") 
Table: MAE by regression
Regression MAE SD_MAE
Random k-fold CV: Just Risk Factors 1.26 1.96
Random k-fold CV: Spatial Structure 1.19 1.87
Spatial LOGO-CV: Just Risk Factors 1.29 2.02
Spatial LOGO-CV: Spatial Structure 1.24 1.99

5.3. Generalizability by race

We use race to check the generalzability of this model, it shows that the model is not such general for different race although the difference drops significantly when we add spatial related variable. All models shows that we overpredict non-white communities which means that the count of assault that we predicted in non-white majority neighbourhood is larger than the obsrevation; and all models underpredict white mojority communitites. However, what we can argue is that the difference maybe unavoidable since this bias is embed in the crime dataset.

#census_api_key("3929725aba6e199105d8e902356ad2fb95af811d",install=TRUE,overwrite=TRUE)
#readRenviron("~/.Renviron")
tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2017, state=17, county=031, geometry=T) %>%
  st_transform("ESRI:102271")  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
        NumberWhites = B01001A_001) %>%
 mutate(percentWhite = NumberWhites / TotalPop,
        raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

final_reg <- 
  filter(reg.summary) %>%
  mutate(uniqueID = rownames(.))

final_reg.tracts <- 
  st_join(st_centroid(final_reg), tracts17) %>%
    st_set_geometry(NULL) %>%
    left_join(dplyr::select(final_reg, uniqueID)) %>%
    st_sf() %>%
    na.omit()

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
    kable_styling("striped", full_width = F) 

6.Traditional method & our model

Comparing the traditional hotspot map (by kernel density) to our prediction hotspot (Random k-fold CV: Spatial Structure), we found that the hotspot of the high-risk area in our model is much more detail than the traditional one, which can help police to allocate police much easier.

# Compute kernel density
weapons_ppp = as.ppp(st_coordinates(weapons), W = st_bbox(final_net))
weapons_KD = density.ppp(weapons_ppp, 1000)

# Convert kernel density to grid cells taking the mean
weapon_KDE_sf <- as.data.frame(weapons_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set crime counts are spatially joined to the fisnnet.
  cbind(
    aggregate(
      dplyr::select(weapons) %>% mutate(weaponCount = 1), ., length) %>%
    mutate(weaponCount = replace_na(weaponCount, 0))) %>%
#Select the fields we need
  dplyr::select(label, Risk_Category, weaponCount)


weapon_risk_sf <-
  filter(final_reg, Regression == "Random k-fold CV: Spatial Structure") %>%
   st_as_sf(crs = st_crs(weapon_KDE_sf)) %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(weapons) %>% mutate(weaponCount = 1), ., length) %>%
      mutate(weaponCount = replace_na(weaponCount, 0))) %>%
  dplyr::select(label,Risk_Category, weaponCount)

rbind(weapon_KDE_sf, weapon_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(weapons, 1500), size = .1, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(option="cividis", discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="Relative to test set points (in black)") +
    mapTheme()

This part calculated both models’ rate of predicted crime to observed crime by risk catogory. Although for 1% to 90% risk category, traditional model performs better than our model, our model can capture more high-risk crime (90%-100%)

rbind(weapon_KDE_sf, weapon_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countassault = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countassault / sum(countassault)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE)+
  labs(title = "Risk prediction vs. Kernel density, 2018 Thefts") 

7.Conclusion

Generally, this algorithm uses “simple assault” crime data and additional objective risk and protective factors to predict future potential crime. I would recommend this algorithm since it predicts better than the traditional method. This method tries to mitigate the impact of data bias and try to be accurate & general for difference race community. For further improvement, more related independent variables need to be included in this model to perform a better prediction